8 research outputs found
FM-index on GPU : a cooperative scheme to reduce memory footprint
The FM-index is a data structure which is seeing more and more pervasive use, in particular in the field of highthroughput bioinformatics. Algorithms based on it show a pseudo-random memory access pattern. As a consequence, they are usually bound by memory bandwidth rather than CPU usage. Naive GPU implementations are no exception. Here we show that the combination of a compact design of the FM-index and a thread-cooperative approach can be used to restore a proper balance. The resulting solution is less memory-bandwidth intensive, and allows full exploitation of the computational resources of the GPU across several GPU architectures
Thread-cooperative, bit-parallel computation of Levenshtein distance on GPU
Approximate string matching is a very important problem in computational biology; it requires the fast computation of string distance as one of its essential components. Myers' bit-parallel algorithm improves the classical dynamic programming approach to Levenshtein distance computation, and offers competitive performance on CPUs. The main challenge when designing an efficient GPU implementation is to expose enough SIMD parallelism while at the same time keeping a relatively small working set for each thread. In this work we implement and optimise a CUDA version of Myers' algorithm suitable to be used as a building block for DNA sequence alignment. We achieve high efficiency by means of a cooperative parallelisation strategy for (1) very-long integer addition and shift operations, and (2) several simultaneous pattern matching tasks. In addition, we explore the performance impact obtained when using features specific to the Kepler architecture. Our results show an overall performance of the order of tera cells updates per second using a single high-end Nvidia GPU, and factor speedups in excess of 20 with respect to a sixteen-core, non-vectorised CPU implementation
Optimitzaci贸 d'una aplicaci贸 bioinform脿tica d'aliniament de seq眉猫ncies executada en processadors many-core (GPUs)
Las herramientas de an谩lisis de secuencias gen贸micas permiten a los bi贸logos identificar y entender regiones fundamentales que tienen implicaci贸n en enfermedades gen茅ticas. Actualmente existe una necesidad de dotar al 谩mbito cient铆fico de herramientas de an谩lisis eficientes. Este proyecto lleva a cabo una caracterizaci贸n y an谩lisis del rendimiento de algoritmos utilizados en la comparaci贸n de secuencias gen贸micas completas, y ejecutadas en arquitecturas MultiCore y ManyCore. A partir del an谩lisis se eval煤a la idoneidad de este tipo de arquitecturas para resolver el problema de comparar secuencias gen贸micas. Finalmente se propone una serie de modificaciones en las implementaciones de estos algoritmos con el objetivo de mejorar el rendimiento.Les eines d'an脿lisi de seq眉猫ncies gen貌miques permeten als bi貌legs identificar i entendre regions fonamentals que tenen implicaci贸 en malalties gen猫tiques. Actualment hi ha una necessitat d'aportar a l'脿mbit cient铆fic eines d'an脿lisi eficients. Aquest projecte desenvolupa una caracteritzaci贸 i an脿lisi del rendiment d'algoritmes utilitzats en la comparaci贸 de seq眉猫ncies gen貌miques completes executades en arquitectures MultiCore i ManyCore. A partir de l'an脿lisi s'evalua la idone茂tat d'aquest tipus d'arquitectures per resoldre el problema de la comparaci贸 de seq眉猫ncies gen貌miques. Finalment es proposen una s猫rie de modificacions en les implementacions d'aquests algoritmes amb l'objectiu de millorar el rendiment.The analysis tools of the genomic sequence allow biologists to identify and understand the basic regions that are involved in genetic diseases. Nowadays there is the necessity to give the science efficiency analyse tools. This project makes a characterisation and analysis of the output in the algorithms used on the complete sequence comparison, performed on MultiCore and ManyCore architectures. From this analysis the suitability of this kind of architectures on the solution of the comparison gene sequence is evaluated. Finally a series of modifications for the implementations of these algorithms are proposed, to allow the output improvement
Boosting the FM-index on the GPU : effective techniques to mitigate random memory access
The recent advent of high-throughput sequencing machines producing big amounts of short reads has boosted the interest in efficient string searching techniques. As of today, many mainstream sequence alignment software tools rely on a special data structure, called the FM-index, which allows for fast exact searches in large genomic references. However, such searches translate into a pseudo-random memory access pattern, thus making memory access the limiting factor of all computation-efficient implementations, both on CPUs and GPUs. Here we show that several strategies can be put in place to remove the memory bottleneck on the GPU: more compact indexes can be implemented by having more threads work cooperatively on larger memory blocks, and a k-step FM-index can be used to further reduce the number of memory accesses. The combination of those and other optimisations yields an implementation that is able to process about 2 Gbases of queries per second on our test platform, being about 8脳 faster than a comparable multi-core CPU version, and about 3脳 to 5脳 faster than the FM-index implementation on the GPU provided by the recently announced Nvidia NVBIO bioinformatics library
Optimitzaci贸 d'una aplicaci贸 bioinform脿tica d'aliniament de seq眉猫ncies executada en processadors many-core (GPUs)
Las
herramientas
de
an谩lisis
de
secuencias
gen贸micas
permiten
a
los
bi贸logos
identificar
y
entender
regiones
fundamentales
que
tienen
implicaci贸n
en
enfermedades
gen茅ticas.
Actualmente
existe
una
necesidad
de
dotar
al
谩mbito
cient铆fico
de
herramientas
de
an谩lisis
eficientes.
Este
proyecto
lleva
a
cabo
una
caracterizaci贸n
y
an谩lisis
del
rendimiento
de
algoritmos
utilizados
en
la
comparaci贸n
de
secuencias
gen贸micas
completas,
y
ejecutadas
en
arquitecturas
MultiCore
y
ManyCore.
A
partir
del
an谩lisis
se
eval煤a
la
idoneidad
de
este
tipo
de
arquitecturas
para
resolver
el
problema
de
comparar
secuencias
gen贸micas.
Finalmente
se
propone
una
serie
de
modificaciones
en
las
implementaciones
de
estos
algoritmos
con
el
objetivo
de
mejorar
el
rendimiento.Les
eines
d'an脿lisi
de
seq眉猫ncies
gen貌miques
permeten
als
bi貌legs
identificar
i
entendre
regions
fonamentals
que
tenen
implicaci贸
en
malalties
gen猫tiques.
Actualment
hi
ha
una
necessitat
d'aportar
a
l'脿mbit
cient铆fic
eines
d'an脿lisi
eficients.
Aquest
projecte
desenvolupa
una
caracteritzaci贸
i
an脿lisi
del
rendiment
d'algoritmes
utilitzats
en
la
comparaci贸
de
seq眉猫ncies
gen貌miques
completes
executades
en
arquitectures
MultiCore
i
ManyCore.
A
partir
de
l鈥檃n脿lisi
s'evalua
la
idone茂tat
d'aquest
tipus
d'arquitectures
per
resoldre
el
problema
de
la
comparaci贸
de
seq眉猫ncies
gen貌miques.
Finalment
es
proposen
una
s猫rie
de
modificacions
en
les
implementacions
d'aquests
algoritmes
amb
l'objectiu
de
millorar
el
rendiment.The
analysis
tools
of
the
genomic
sequence
allow
biologists
to
identify
and
understand
the
basic
regions
that
are
involved
in
genetic
diseases.
Nowadays
there
is
the
necessity
to
give
the
science
efficiency
analyse
tools.
This
project
makes
a
characterisation
and
analysis
of
the
output
in
the
algorithms
used
on
the
complete
sequence
comparison,
performed
on
MultiCore
and
ManyCore
architectures.
From
this
analysis
the
suitability
of
this
kind
of
architectures
on
the
solution
of
the
comparison
gene
sequence
is
evaluated.
Finally
a
series
of
modifications
for
the
implementations
of
these
algorithms
are
proposed,
to
allow
the
output
improvement
Thread-cooperative, bit-parallel computation of Levenshtein distance on GPU
Approximate string matching is a very important problem in computational biology; it requires the fast computation of string distance as one of its essential components. Myers' bit-parallel algorithm improves the classical dynamic programming approach to Levenshtein distance computation, and offers competitive performance on CPUs. The main challenge when designing an efficient GPU implementation is to expose enough SIMD parallelism while at the same time keeping a relatively small working set for each thread. In this work we implement and optimise a CUDA version of Myers' algorithm suitable to be used as a building block for DNA sequence alignment. We achieve high efficiency by means of a cooperative parallelisation strategy for (1) very-long integer addition and shift operations, and (2) several simultaneous pattern matching tasks. In addition, we explore the performance impact obtained when using features specific to the Kepler architecture. Our results show an overall performance of the order of tera cells updates per second using a single high-end Nvidia GPU, and factor speedups in excess of 20 with respect to a sixteen-core, non-vectorised CPU implementation
FM-index on GPU : a cooperative scheme to reduce memory footprint
The FM-index is a data structure which is seeing more and more pervasive use, in particular in the field of highthroughput bioinformatics. Algorithms based on it show a pseudo-random memory access pattern. As a consequence, they are usually bound by memory bandwidth rather than CPU usage. Naive GPU implementations are no exception. Here we show that the combination of a compact design of the FM-index and a thread-cooperative approach can be used to restore a proper balance. The resulting solution is less memory-bandwidth intensive, and allows full exploitation of the computational resources of the GPU across several GPU architectures
Boosting the FM-index on the GPU : effective techniques to mitigate random memory access
The recent advent of high-throughput sequencing machines producing big amounts of short reads has boosted the interest in efficient string searching techniques. As of today, many mainstream sequence alignment software tools rely on a special data structure, called the FM-index, which allows for fast exact searches in large genomic references. However, such searches translate into a pseudo-random memory access pattern, thus making memory access the limiting factor of all computation-efficient implementations, both on CPUs and GPUs. Here we show that several strategies can be put in place to remove the memory bottleneck on the GPU: more compact indexes can be implemented by having more threads work cooperatively on larger memory blocks, and a k-step FM-index can be used to further reduce the number of memory accesses. The combination of those and other optimisations yields an implementation that is able to process about 2 Gbases of queries per second on our test platform, being about 8脳 faster than a comparable multi-core CPU version, and about 3脳 to 5脳 faster than the FM-index implementation on the GPU provided by the recently announced Nvidia NVBIO bioinformatics library